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Abstract 

Adaptive control problems are notoriously 
difficult to solve even in the presence of plant- 
specific controllers. One way to by-pass the 
intractable computation of the optimal pol- 
icy is to restate the adaptive control as the 
minimization of the relative entropy of a con- 
troller that ignores the true plant dynam- 
ics from an informed controller. The solu- 
tion is given by the Bayesian control rule — 
a set of equations characterizing a stochas- 
tic adaptive controller for the class of possi- 
ble plant dynamics. Here, the Bayesian con- 
trol rule is applied to derive BCR-MDP, a 
controller to solve undiscounted Markov de- 
cision processes with finite state and action 
spaces and unknown dynamics. In partic- 
ular, we derive a non-parametric conjugate 
prior distribution over the policy space that 
encapsulates the agent's whole relevant his- 
tory and we present a Gibbs sampler to draw 
random policies from this distribution. Pre- 
liminary results show that BCR-MDP suc- 
cessfully avoids sub-optimal limit cycles due 
to its built-in mechanism to balance explo- 
ration versus exploitation. 



1. Introduction 

Adaptive control problems, i.e. the design of con- 
trollers for plants with unknown dynamics, are no- 
toriously difficult. Even when the plant dynamics is 
known to belong to a particular class for which optimal 
controllers are available, constructing the correspond- 
ing optimal adaptive controller is in general intractable 
(Duff, 2002). Thus, virtually aU of the effort of the re- 
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search community is centered around the development 
of tractable approximations. 

Recently, new formulations of the adaptive control 
problem that are based on the minimization of a rel- 
ative entropy criterion have attracted the interest of 
the reinforcement learning (RL) community. For ex- 
ample, it has been shown that a large class of opti- 
mal control problems can be solved very efficiently if 
the problem statement is reformulated as the mini- 
mization of the deviation of the dynamics of a con- 
trolled system from the uncontrolled system (Todorov, 
2006; 2009; Kappen et al., 2009). A similar approach 
minimizes the deviation of the causal input/output- 
relationship of a Bayesian mixture of controllers from 
the true controller, obtaining an explicit solution called 
the Bayesian control rule (Ortega & Braun, 2010). 
This control rule is particularly interesting because it 
leads to stochastic controllers that infer the optimal 
controller on-line by combining the plant-specific con- 
trollers, implicitly using the uncertainty of the dynam- 
ics to trade-off exploration versus exploitation. 

Markov decision processes (MDPs) with undis- 
counted/averaged rewards constitute an important 
problem class in RL that has been far less studied 
than their discounted counterpart. While discounted 
rewards are suitable in many applications, a wide va- 
riety of tasks — such as those found in control tasks 
where the optimal trajectory is a limit cycle, e.g. net- 
work load balancing, automatic assembly, queue man- 
agement and control of embedded systems — are more 
naturally stated in terms of optimizing the average 
reward. However, finding an optimal policy for the 
average reward function is significantly more difficult 
than the discounted reward. Unlike the discounted 
case, in undiscounted MDPs the Bellman optimality 
equations are strongly coupled and the effective hori- 
zon is unbounded. A systematic study in Mahadevan 
(1996) has shown that exploration plays a crucial role 
in undiscounted MDP algorithms, as insufficient explo- 
ration may lead to the convergence to a sub-optimal 
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limit cycle. Several algorithms have been proposed for 
undiscounted MDPs, most notably R-learning and its 
variants (Schwartz, 1993; Singh, 1994), which are in- 
spired by Watkins' Q-lcarning (Watkins, 1989) and arc 
simple to implement; and (Kearns & Singh, 1998) 
and R-max (Brafman & Tcnnenholtz, 2001), which arc 
advanced algorithms that attain near-optimal average 
reward in polynomial time. 

The aim of this paper is to demonstrate how the 
Bayesian control rule can be used to solve adaptive 
control problems, illustrating its generality and con- 
ceptual simplicity. In particular, undiscounted MDPs 
with finite state and action space and unknown dy- 
namics. We derive an adaptive controller, which we 
call BCR-MDP, that employs a conjugate prior dis- 
tribution over the policy space to concisely encapsu- 
late the agent's history and to infer the optimal pol- 
icy. Furthermore, we introduce a Gibbs sampler im- 
plementing the controller. 

2. Background 

2.1. Bayesian control rule 

Let O and A be two finite sets of symbols, where 
the former is the set of inputs (observations) and 
the second the set of outputs (actions). Actions and 
observations at time t are denoted as at S ^ and 
Of S O respectively, and we use the shorthand a<t := 
ai, a2, . . . , at and the like to simplify the notation of 
strings. We assume that the interaction between the 
controller and the plant proceeds in cycles t = 1, 2, . . . 
where in cycle t the controller issues action at and the 
plant responds with an observation Of. A controller 
is defined as a probability distribution P over the in- 
put/output (I/O) stream, and it is fully characterized 
by the conditional probabilities 

P{at\a^t,o<^t) and P(ot|a<t, o<t) 

representing the probabilities of emitting action at and 
collecting observation ot given the respective I/O his- 
tory. Similarly, a plant is defined as a probability dis- 
tribution Q characterized by the conditional probabil- 
ities 

Q{ot\a<t,o<t) 

representing the probabilities of emitting observation 
Ot given the I/O history. 

If the plant is known, i.e. if the conditional probabil- 
ities Q{ot\a<t, o<:t) are known, then the designer can 
build a suitable controller by equating the observa- 
tion streams as P(ot|a<t, o<t) = Q(ot|a<t, o<t) and by 
defining action probabilities P{at\a^t,o<:^t) such that 



the resulting distribution P maximizes a desired util- 
ity criterion. We say that P is tailored to Q. In 
many cases the conditional probabilities P(a(|a<t, o<j) 
will be deterministic, but there are situations (e.g. 
in repeated games) where the designer might prefer 
stochastic policies instead. 

If the plant is unknown then one faces an adaptive con- 
trol problem. Assume we know that the plant Qg is 
going to be drawn randomly from a set Q := {Qejeee 
of possible plants indexed by O. Assume further we 
have available a set of controllers V := {Pe}eeBi where 
each Pg is tailored to Qg. How can we now construct a 
controller P such that its behavior is as close as possi- 
ble to the tailored controller Pg under any realization 
of Qg e Q? 

A nai've approach would be to minimize the relative 
entropy of the controller P with respect to the true 
controller Pg, averaged over all possible values of 9. 
However, this is syntactically incorrect. The impor- 
tant observation made in Ortega & Braun (2010) is 
that we do not want to minimize the deviation of P 
from Pg, but the deviation of the causal I/O depen- 
dencies in P from the causal I/O dependencies in Pg. 
Intuitively speaking, we do not want to predict actions 
and observations, but to predict the observations (ef- 
fect) given actions (causes). More specifically, they 
propose to minimize a set of (causal) divergences C 
defined by 

t 

C hm sup ^ P{9) ^ Cr 
g T=i 

Cr ■.^^^Pg{a^r,0^r)Cria<r,0<r) (1) 

CAh) :^EEP.(a.,a.|fe)log ^;("-°-'^) 

P(aT,Or\h) 

where P{0) is the prior probability of € O, Ot de- 
notes an intervened (not observed) action at time r, 
and Si , 02 , 03 , ... is an arbitrary sequence of intervened 
actions. 

In Ortega & Braun (2010), it is shown that the con- 
troller P that minimizes C in Equation (1) for any 
sequence of intervened actions is given by the condi- 
tional probabilities 

P{at\a^t,o^t) ■= YlPs^°'t\°'<t^°<t)P^^\^<t^°<i) 


P{ot\a<t,o<t) ■= Pg{ot\a<t, o<t)P{d\a<t,o<t) 


(2) 
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where 



P{0\a<t,o<t) := 



Pe{ot\a<t,o<t)Pi0\a<t,o<t) 
J2e' Pe'{ot\a<t,o<^t)P{d'\a<t,o<t) ' 



(3) 

Equations (2) and (3) constitute the Bayesian control 
rule. This resuh is obtained by using properties of 
interventions using causal calculus (Pearl, 2000). It is 
worth to point out that the resulting controller is fully 
defined in terms of its constituent controllers in T-". It 
is customary to use the notation 

P{at\0,a<t,o<t) ■= Pe(at|a<t,o<() 
P{ot\0,a<t,o<^t) := Pe{ot\a<t,o<t), 

that is, treating the different controllers as hypothe- 
ses of a Bayesian model. The resulting control law 
is in general stochastic. Also, note that by construc- 
tion, an adaptive code for the I/O stream based on the 
Bayesian control rule is optimal for the class of plants 
considered (MacKay, 2003). 

2.2. MDPs 

Definitions. An MDP is defined as a tuple 
{X,A^T,r): X is the state space; A is the action 
space; Ta{x]x') = 'Pr(x'\a,x] is the probability that 
an action a & A taken in state x ^ X will lead to state 
x' e X\ and r(x,a) S 7?. := M is the immediate re- 
ward obtained in state x & X and action a £ A. The 
interaction proceeds in time steps t ~ 1,2,... where 
at time t, action G ^ is issued in state Xt-i G X, 
leading to a reward rt — r{xt-i,at) and a new state 
Xt that starts the next time step t + 1. Hence, starting 
from an initial state xq £ A", an I/O sequence has the 
form 

Xq ^ ai ^ (ri,xi) 02 -> {r2,X2) ■ ■ ■ 

■■■ ^ at-i {rt-i,xt-i) at ^ {rt.xt) ■ ■ ■ 

A stationary closed-loop control policy n : X ^ A 
assigns an action to each state. For MDPs there al- 
ways exists an optimal stationary deterministic pol- 
icy and thus one only needs to consider such policies. 
For undiscounted MDPs, the goal is to find a policy 
that maximizes the time-averaged reward j X]t=i ^•^ 
as t — > oo. 

Bellman optimality equations. In undiscounted 
MDPs the average reward per time step for a fixed 
policy TT with initial state x is defined as follows: 
p^{x) = YmYt^oo'^^[\Y^\^Qfr]- It can be shown 
(Bertsekas, 1987) that p''{x) = p^{x') for aU x,x' € X 
under the assumption that the Markov chain for pol- 
icy TT is ergodic. Here, we assume that the MDPs are 



ergodic for all stationary policies. Following the Q- 
notation of Watkins (1989), the optimal policy tt* can 
be characterized in terms of the optimal average re- 
ward p and the optimal relative Q-values Q{x,a) for 
each state-action pair (x, a) that are solutions to the 
following system of non-linear equations (Singh, 1994): 
for any state x € X and action a £ A, 

Q{x,a) + p ^ r{x,a) + y Pr(a::'|a;, a) maxQ(a::', a') 

^ — ^ L a' 



~ r{x, a) + Ej.' maxQ(x', a') 

L a' 



(4) 



For this setup, the optimal policy is defined as tt* (x) 
argmaxa Q{x,a) for any state x £ X. 

3. Derivation of the Controller 

One can exploit the Bellman optimality equations 
in (4) to define a space of optimal controllers. In 
particular, any p £ M. and collection of Q-values 
Q{x,a) £ M where a; G N and a £ A characterize 
an optimal controller. Hence, one can parameterize 
the space of controllers with a vector 6* G 8 := 
containing the average reward and all the Q-values. 
To apply the Bayesian control rule, we need to derive 
probabilistic models for actions and observations. 

Noting that in cycle t the controller issues an action 
at £ A and receives a reward rt £ TZ and a state a;^ G 
X, one can define the space of actions and observations 
for the Bayesian control rule as A and O := TZ x X 
respectively. 

Let X — Xt-i, a — at, r = rt and x' ~ Xt- Given the 
controller's parameter vector 9, the only additional in- 
formation needed to apply the optimal policy is given 
by the last state x. Hence, we impose the indepen- 
dence property 

P(at, 0416*, a<t, o<t) = P{a, r, x'\9, x). 

Furthermore, this can be decomposed as a product of 
three conditional probabilities: 

P{a, r, x'\9, x) = P{a\9, x)P{x'\9, x, a)P{r\9, x, a, x'). 

(5) 

The first term, i.e. the probability of action a given 
9, X and a, is given by: 



Pia\9,x)^P{a\{Qix,a')}a'eA) 

J 1 if a = argmaXfi' Q{x, a') 
I else, 



(6) 



which is just the action taken by the optimal policy tt* 
in state x. 
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For the second term in (5), i.e. the state transition 
probabihties given the past interactions, we observe 
that the average reward p and the Q-vahies Q{x,a) 
encoded in 9 do not provide enough information to 
encode the transition probabihties. Thus we conclude 
that they are independent of the parameter, that is: 

for any 0, 61' G e, P{x'\e,x,a) = P{x'\e' ,x,a). (7) 

Finahy we derive P{r\0 '), i.e. the probabihties 

of rewards given the past interactions and the next 
state X. Note that the reward function r{x,a) is not 
parameterized by 9, thus we cannot know the exact 
value of r from [9, x, a, x'). 

Let £^{x,a,x') be the mean instantaneous reward de- 
fined by 

^(a;, a, x') := Q(x, a) + p — maxQ(x', a'). (8) 

a' 

This quantity represents the mean of the instanta- 
neous reward r{x,a) as estimated indirectly using the 
pre- and post-action Q-values. Indeed, it is seen from 
Equation (4) that 



where 



r{x, a) ~ ^{x, a, x') + v, 



max(5(x', a') — E[max(5(x', a')|x, a] 



(9) 



Here, the term u is a deviation from r{x, a) that can 
be interpreted as random observation noise. Assuming 
that V can be reasonably approximated by a normal 
distribution A/'(0, 1/p) with precision p, then we can 
write down a likelihood model for the immediate re- 
ward r using the Q- values and the average reward, i.e. 

P(r|0, a, x') = y^exp{ -| (r - ^(x, a, }. 

(10) 

This completes our model of the controller with pa- 
rameter vector 9. 

To apply the Bayesian control rule over the con- 
trollers in 0, the intervened posterior distribution 
P{9\a<t, o<t) defined in Equation (3) needs to be com- 
puted. Fortunately, due to the simplicity of the like- 
lihood model, one can easily devise a conjugate prior 
distribution. 

Inserting the likelihood into Equation (3), one obtains 

P{9\a<t,o<t) 

^ P{x'\9, X, a)P{r\9, .t, a, x')P{9\a<t,o<t) 
~ L P{x'\9', X, a)P{r\e',x, a, x')P{9'\a<uo<t) d9' 



where we have replaced the sum by an integration 
over 0, the finite-dimensional real space containing 
only the average reward and the Q-values of the ob- 
served states, and where we have simplified the term 
P{x'\9, X, a) because it is constant for all 9' G Q. 

By inspection of Equation (11), one sees that 6 en- 
codes a set of independent normal distributions over 
the immediate reward having means ^(x, a, x') indexed 
by triples {x,a,x') G X x A x X. In other words, 
given (x,a,a;'), the rewards are drawn from a nor- 
mal distribution with unknown mean ^(x, a,x') and 
known variance cr^. The sufficient statistics are given 
by n(x, a, x'), the number of times that the transition 
x — > x' under action a, and f(x, a, x'), the mean of the 
rewards obtained in the same transition. The conju- 
gate prior distribution is well known and given by a 
normal distribution with hyperparameters p,Q and Xq: 

P(^(x,a,x')) =AA(Aio,l/Ao) 

= y^cxp{-f (C(x,a,x')-/.o)'}. (12) 
The posterior distribution is given by 

P(^(x, a, x')\a<t,o<t) = J\f{pix, a, x'), 1/A(x, a, x')) 
where the posterior hyperparameters are computed as 
Ao Mo + pn{x,a,x') f(x, a, x') 



Kx,a,x') \ , / n 

aq + pn(x, a, X ) 

A(x, a, x') = Aq + pn{x, a, x'). 



(13) 



P{r\9,x,a,x')P{9\a<t.o<t) 
/q P{r\9',x, a, x')P{9'\a<u o<t) d9' ' 



(11) 



Finally, the conjugate distribution of the parameter 
vector 9 is simply the product 

P{9\a<t,o<t) = W P{£,{x,a,x')\a<t,o<t) 
oc exp| — - A(x, a, a, x')— ^(x, a, x'))^! 

x,a,x' 

(14) 

because the ^(x, a, x') are independent but at the same 
time functions of 6* (Equation 8). Thus, the BCR-MDP 
controller is fully specified by the actions probabilities 
in Equation (6), the likelihood models in Equations (7) 
and (10), and the prior distribution (12). 

4. Inference and Acting 

Inference can be carried out by sampling 9 from the 
posterior distribution in Equation (14). The actions 
issued by BCR-MDP are by-products of the inference 
process. Here we derive an approximate Gibbs sampler 
for 9. We introduce the following symbols: 9^^ and 
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0-Q{x,a) g^and for the parameter set removing p and 
Q{x,a) respectively; fi and A are matrices collecting 
the values of the posterior hyperparameters fi{x, a, x') 
and A(a;, a, x') respectively; and M{x) := max^ Q{x, a) 
is a shorthand. 

Substituting ^(x,a,x') in Equation (14) by its defini- 
tion (Equation 8) and conditioning on the Q- values, 
we obtain the conditional distribution of p: 

P{p\e~P,p,X)^Af{p,l/S) (15) 

where 

^ X{x,a,x'){p{x,a,x') - Q{x,a) + M{x')), 
S = A(a;, a, x'). 

The conditional distribution over the Q-values is more 
difficult to obtain, because each (5(x, a) enters the 
posterior distribution both linearly and non-linearly 
through p. However, if we fix Q{x,a) within the max 
operations, which amounts to treating each M{x) as 
a constant within a single Gibbs step, then the condi- 
tional distribution can be approximated by 

P{Q{x, a)|0-«?(-^'') , A, /i) « AA(g(.T, a), l/S{x, a)) 

(16) 

where 

0(a;, a) = ^ V A(a:, a, x'){p{x, a, x') - p + M{x')), 
b(x, a) ^ 

x' 

Six, a) = A(x, a, x'). 

x' 

We expect this approximation to hold because the re- 
sulting update rule constitutes a contraction operation 
that forms the basis of most stochastic approximation 
algorithms (Mahadevan, 1996). As a result, the Gibbs 
sampler draws all the values from normal distributions. 
In each cycle of the adaptive controller, one can carry 
out several Gibbs sweeps to obtain a sample of 9 to im- 
prove the mixing of the Markov chain. However, our 
experimental results have shown that a single Gibbs 
sweep per state transition performs reasonably well. 

Once a new parameter vector 9 is drawn, BCR-MDP 
proceeds by taking the optimal action given by Equa- 
tion (6). The resulting algorithm is listed in Algo- 
rithm 4. Note that only the p and A entries of the 
transitions that have occurred need to be represented 
explicitly; similarly, only the Q-values of visited states 
need to be represented explicitly. 



Algorithm 1 BCR-MDP Gibbs sampler. 
Initialize entries oi 9, X and p to zero. 
Set initial state to x <— xq. 
for t = 1,2,3,... do 
{ Interaction } 

Set a <— argmaxa' Q{x,a') and issue a. 
Obtain o = {r,x') from plant. 

{Update hyperparameters} 

/ /\ , X(x,a,x')ii(x,a.x')+pr 

p{x,a,x ) < ^ w \'i 

' ' / X{x,a,x'j-\-p 

X{x,a,x') <r- X{x,a,x') +p 

{Gibbs sweep} 

Sample p using (15). 

for all Q{y,b) of visited states do 

Sample Q{y,b) using (16). 
end for 
Set X ^ x' . 
end for 



5. Preliminary Empirical Results 

We have tested BCR-MDP in two toy examples: a 
grid-world domain, and on a suite of randomly gen- 
erated MDPs. To give an intuition of the achieved 
performance, the results are contrasted with those 
achieved by R-learning. We have used the R-learning 
variant presented in Singh (1994, Algorithm 3) to- 
gether with the uncertainty exploration strategy (Ma- 
hadevan, 1996). The corresponding update equations 
are 

Q{x, a) <— (1 — a)Q{x, a) + a(^r — p + niaxQ{x' , a')) 

(1 — /3)/9 + /3(r + maxQ(x', a') — Q{x,a)) , 

(17) 

where a, (3 > are learning rates. The exploration 
strategy chooses with fixed probability Pcxp > the 
action a that maximizes Q{x, a) + -p^-^, where C is a 
constant, and F{x, a) represents the number of times 
that action a has been tried in state x. Thus, higher 
values of C enforce increased exploration. 

Grid-world domain. In Mahadevan (1996), a grid- 
world is described that is especially useful as a test bed 
for the analysis of RL algorithms. For our purposes, 
it is of particular interest because it is easy to design 
experiments containing suboptimal limit- cycles. 

Figure 1, panel (a), illustrates the 7x7 grid- world. 
A controller has to learn a policy that leads it from 
any initial location to the goal state. At each step, 
the agent can move to any adjacent space (up, down, 
left or right). If the agent reaches the goal state then 
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a) 7x7 Maze 



c) R-lcaming, C=5 d) R-leaming, C=30 





e) Average Reward 







'•■R-learning, C=30 






/,' / 


R-leaming, C=5 




BCR-MDP 



125 250 375 

xlOOO time .steps 



Figure 1. Results for the 7x7 grid-world domain. Panel (a) illustrates the setup. Columns (b)-(d) illustrate the behavioral 
statistics of the algorithms. The upper and lower row have been calculated over the first and last 5,000 time steps of 
randomly chosen runs. The probability of being in a state is color-encoded, and the arrows represent the most frequent 
actions taken by the agents. Panel (e) presents the curves obtained by averaging ten runs. 



its next position is randomly set to any square of the 
grid (with uniform probability) to start another trial. 
There are also "one-way membranes" that allow the 
agent to move into one direction but not into the other. 
In these experiments, these membranes form "inverted 
cups" that the agent can enter from any side but can 
only leave through the bottom, playing the role of a 
local maximum. Transitions arc stochastic: the agent 
moves to the correct square with probability P = jq 
and to any of the free adjacent spaces (uniform dis- 
tribution) with probability 1 — p = j^. Rewards are 
assigned as follows. The default reward is r = 0. If 
the agent traverses a membrane it obtains a reward of 
r = 1. Reaching the goal state assigns r = 2.5. 

The parameters chosen for this simulation were the 
following. For BCR-MDP, we have chosen hyperpa- 
ramcters fiQ ^ 1 and Aq = 1 and precision p — I. For 
R-learning, we have chosen learning rates a = 0.5 and 
/3 = 0.001, and the exploration constant has been set 
to C = 5 and to C = 30. 

A total of 10 runs were carried out for each algorithm. 
The results are presented in Figure 1 and Tabic 1. R- 
learning only learns the optimal policy given sufficient 
exploration (panels c & d, bottom row), whereas BCR- 
MDP learns the policy successfully. In Figure le, the 
learning curve of R-learning is initially steeper than 
the Bayesian controller. However, the latter attains a 
higher average reward around time step 125,000 on- 
wards. We attribute this shallow initial transient to 
the phase where the distribution over the operation 
modes is flat, which is also reflected by the initially 
random exploratory behavior. 

To test wether the performance of BCR-MDP scales 
up with a larger problem, we have conducted a sec- 



Table 1. Average reward attained by the different algo- 
rithms at the end of the run. The mean and the standard 
deviation has been calculated based on 10 runs. 



Average Reward 


BCR-MDP 




0.3582 ± 0.0038 


R-learning, C — 


30 


0.3056 ±0.0063 


R-learning, C — 


5 


0.2049 ±0.0012 



ond grid-world experiment with where the number of 
states has roughly been doubled. The results for this 
10x10 maze are illustrated in Figure 2. The reward for 
reaching the goal state has been set to r = 10 in this 
case. The precision for this experiment has been set to 
p = 1/3 to reflect higher uncertainty. This is still very 
low given that the range of possible rewards is [0; 10]. 
We have simulated one run of one million time steps. 
Again, one can see that the algorithm moves from a 
highly exploratory phase to an exploitative phase (Fig- 
ure 2, left panels), eventually converging towards the 
optimal policy. The learning curve shows a steady 
increase in performance (Figure 2, right panel). Inter- 
estingly, around time step 300,000 the curve shows an 
abrupt change in slope. Presumably this is due to a 
change of the belief state: the algorithm was exploiting 
one of the two suboptimal limit cycles when it discov- 
ered the optimal limit cycle. This confirms our intu- 
ition, because the inference process cannot converge as 
long as there is still uncertainty over the policy space. 

Randomly generated MDPs. The purpose of this 
experiment is to test the robustness of the algorithm 
under different environments. In this second test bed, 
random ergodic MDPs have been generated: a) with 
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{10-state 5-action MDPs) 



Policy iteration 



R-leaming, C=20 
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Average Reward 
(20-state 5-action MDPs) 



Policy iteration 
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Figure 3. Comparison of the average reward for BCR-MDP, policy iteration and R-learning. A total of 60 different MDPs 
where randomly generated, one half having 10 states and 5 actions (left panel), and the other half having 20 states and 
5 actions (right panel). The three algorithms where tested on these MDPs and their learning curves averaged. 



Table 2. Average reward attained by the different algo- 
rithms at the end of the run. The mean and the standard 
deviation has been calculated based on 30 runs. 



Average Reward : 


10-state, 5-actions 


20-state, 5-actions 


Policy iteration 
BCR-MDP 
R-learning, C = 20 


0.7114 ± 0.0207 
0.6998 ± 0.0211 
0.6061 ± 0.0216 


0.6906 ± 0.0101 
0.6743 ± 0.0102 
0.5677 ± 0.0104 



10 states and 5 actions and b) with 20 states and 5 ac- 
tions. The transition and payoff matrices have been 
constructed randomly: all transitions had non-zero 
probabilities and all rewards took on values in [0; 1] . In 
each run, a new MDP is generated and the three agents 
are tested on it: BCR-MDP with precision p ^ 1; R- 
learning with a — 0.5, (3 = 0.001 and C = 20; and 
policy iteration. The latter has been used to estimate 
the maximum performance, i.e. the performance of an 
informed agent. We have simulated a total of 30 runs 
with 60,000 time steps for both cases and averaged the 
curves. The results, presented in Figure 3 and Table 2, 
show that BCR-MDP quickly approximates the opti- 
mal average reward, in both cases significantly faster 
than R-learning. 



6. Summary and Conclusion 

The reformulation of the adaptive control problem 
as the minimization of the relative entropy over the 
causal dependencies stated in Equation (1) leads to 
an explicit solution given by the Bayesian control rule. 
This rule constitutes a general method to construct 
adaptive controllers from plant-specific controllers. Its 
main advantage is that it allows replacing the in- 
tractable calculation of the optimal policy for the class 
of plants by an on-line inference procedure, where ac- 
tions are simply by-products of the inference process. 
Conceptually, the Bayesian control rule instantiates 
several well-known ideas: the action selection strat- 
egy is a probability matching method (Wyatt, 1997); 
mixing task-optimal controllers is a mixture of experts 
technique (Jacobs ct al., 1991); and minimizing the 
relative entropy to design a controller is equivalent 
to maximizing the compression of the controller's I/O 
stream (MacKay, 2003). 

To illustrate the potential of the Bayesian control rule, 
we have derived BCR-MDP, an adaptive controller to 
solve undiscounted MDPs with finite state and ac- 
tion spaces and unknown dynamics. BCR-MDP is 
very simple to understand and to implement using the 
Gibbs sampler proposed in Section 4. Empirical re- 
sults show that the built-in exploration-exploitation 
strategy avoids getting trapped in local minima. 
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Using Bayesian techniques in RL has a long history 
and has shown to be useful because they provide a sys- 
tematic way of incorporating prior knowledge and do- 
main assumptions into the problem and updating them 
as more data are observed. This allows quantifying the 
uncertainty of the quantity of interest, e.g. the value 
function, action-value function, etc. The idea of re- 
stating RL as an inference problem has also been pro- 
posed in Toussaint et al. (2006). This approach uses 
the expectation-maximization (EM) algorithm to infer 
the optimal policy, and special pruning techniques to 
reduce the computational complexity. It is interesting 
to point out that in the case of undiscounted MDPs, 
Bayesian Q-learning (Dearden et al., 1998) resembles 
closely BCR-MDP. Our contribution is to show that 
such an algorithm can be derived from a more gen- 
eral relative entropy minimization principle, including 
some features like the implicit exploration-exploitation 
trade-off. 

We expect similar simplifications to hold for the de- 
sign of adaptive controllers for other classes of plant 
dynamics. In particular, potential applications of the 
Bayesian control rule include extensions to continu- 
ous state and action spaces and to partially observable 
Markov processes. 
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